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Abstract 

The Planctomycetes, Verrucomicrobia, Chlamydiae (PVC) super-phylum contains bacteria with either complex cellular organization or 
simple cell structure; it also includes organisms of different lifestyles (pathogens, mutualists, commensal, and free-living). Genome 
content evolution of this group has not been studied in a systematic fashion, which would reveal genes underlying the emergence of 
PVC-specific phenotypes. Here, we analyzed the evolutionary dynamics of 26 PVC genomes and several outgroup species. We 
inferred HGT, duplications, and losses by reconciliation of 27,1 23 gene trees with the species phylogeny. We showed that genome 
expansion and contraction have driven evolution within Planctomycetes and Chlamydiae, respectively, and balanced each other in 
Verrucomicrobia and Lentisphaerae. We also found that for a large number of genes in PVC genomes the most similar sequences are 
present InAcidobacteria, suggesting past and/or current ecological interaction between organisms from these groups. We also found 
evidence of shared ancestry between carbohydrate degradation genes in the mucin-degrading human intestinal commensal 
Akkermansia muciniphila and sequences from Acidobacteria and Bacteroidetes, suggesting that glycoside hydrolases are transferred 
laterally between gut microbes and that the process of carbohydrate degradation is crucial for microbial survival within the human 
digestive system. Further, we identified a highly conserved genetic module preferentially present in compartmentalized PVC species 
and possibly associated with the complex cell plan in these organisms. This conserved machinery is likely to be membrane targeted and 
involved in electron transport, although its exact function is unknown. These genes represent good candidates for future functional 
studies. 

Key words: genome evolution in PVC super-phylum, cellular compartmentalization, DUF1501 and P/ancto/r?ycefes-specific 
cytochromes, mucin-degradation by Akkermansia muciniphila. 



Introduction 

The Planctomycetes, Verrucomicrobia, Chlamydiae (PVC) 
super-phylum is a group of six bacterial phyla: Planctomycetes, 
Verrucomicrobia, Chlamydiae, Lentisphaerae, Poribacteria, 
and OP3 (Wagner and Horn 2006). Some bacteria within 
this group exhibit complex eukaryote-like cellular compart- 
mentalization, whereas others feature simple cell organiza- 
tion. All Chlamydiae have the simple cell structure common 
among Gram-negative bacteria. The cellular structure of OP3 
species and a number of Verrucomicrobia and Lentisphaerae 
organisms, with regard to the presence or absence of 



intracellular membranes, remains unknown. However, all 
characterized Planctomycetes (Fuerst 2005), several Verruco- 
microbia (Lee et al. 2009), one lentisphaerae species (Fuerst 
JA, personal communication), and the poorly characterized 
poribacteria (Fieseler et al. 2004) have a common cell plan 
that features an additional intracellular membrane and is 
unique to these bacteria. Planctomycetes also exhibit vari- 
ations upon this common plan, featuring additional mem- 
brane-enclosed compartments of known or undetermined 
function (Fuerst 2005). These properties make PVC an attract- 
ive group for studying the evolution of biological complexity in 
bacteria. 
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Additionally, the PVC group provides an opportunity to 
study the evolution of bacterial lifestyles. There are four 
main types of lifestyle in bacteria and all of them can be 
found within the PVC super-phylum: free-living organisms 
lacking interaction with a eukaryotic host; mutualists — inter- 
action occurs and both host and symbiont benefit from the 
interaction; commensals — interaction occurs but only the 
symbiont benefits; pathogens — interaction occurs and sym- 
biont benefits from the interaction, whereas it has negative 
consequences on the host. The lifestyle of PVC organisms 
range from free-living soil and aquatic Planctomycetes, 
Verrucomicrobia, and Lentisphaerae, through commensal 
and mutualistic microbes of Verrucomicrobia and Lenti- 
sphaerae, to obligate pathogens of the phylum Chlamydiae 
(Balows et al. 1992) (supplementary table S1, Supplementary 
Material online). Phylum Chlamydiae includes intracellular 
pathogens causing diseases in human and animals. Various 
amoebae are recognized as a natural reservoir (Horn 2008) for 
chlamydial species, although in some cases the nature of the 
ecological relationship between amoebae and Chlamydiae is 
unclear. It has been shown that the presence of certain chla- 
mydial symbionts increases the cytopathic effects of amoebae 
in their hosts, suggesting mutualistic interactions (Fritsche 
et al. 1998). In other cases, the presence of Chlamydiae has 
no effect on the amoebae, Chlamydiae do not become estab- 
lished in the amoebae, or chlamydial presence leads to death 
of the amoebae, which suggests commensal, transient, or 
pathogenic interactions, respectively (Gautom and Fritsche 
1995). Phylum Verrucomicrobia contains the human intestinal 
commensal species Akkermansia muciniphila. Mechanisms 
underlying the A. muciniphila-human interaction are poorly 
understood (Derrien et al. 2011). Therefore, elucidation of 
lineage-specific evolutionary processes associated with the dif- 
ferent PVC lineages could shed light on the emergence of 
pathogens and beneficial microbes, as well as the genetic 
basis of complex cell organization in PVC bacteria. 

Specific organismal properties are often determined by the 
effects of small-scale evolutionary events (nucleotide/amino 
acid replacement and indel substitutions) on gene functions 
(Gu 1 999; Knudsen and Miyamoto 2001 ; Britten 2002; Davids 
et al. 2002; Anisimova and Liberies 2007; Kamneva et al. 
2010). However, the gene content of bacterial genomes can 
vary significantly even between closely related bacterial spe- 
cies, in terms of both the presence/absence and the size of 
particular gene families (Lefebure and Stanhope 2007). These 
larger scale differences can be attributed to horizontal gene 
transfer, gene duplication, emergence of novel protein-coding 
sequences from non-coding genomic regions, and subse- 
quent loss of acquired genes due to either effects on fitness 
or stochastic processes. These genome content changes con- 
tribute to the emergence of different ecological and physio- 
logical properties in microorganisms (Ochman et al. 2000; 
Mira et al. 2001). Within the PVC super-phylum, acquisition 
of "membrane coat-like" proteins on several lineages has 



been proposed as a key evolutionary event leading to the 
complex cell plan in PVC bacteria (Santarella-Mellwig et al. 
201 0). Therefore, reconstruction of genome content evolution 
is fundamental for understanding the natural history of bac- 
terial species and to study evolution of new traits in microbial 
genomes. 

Although genome information is currently available for 20 
species of Planctomycetes, Verrucomicrobia, and Lenti- 
sphaerae and for multiple chlamydial species (five genera), a 
comprehensive analysis of PVC genome dynamics has not 
been reported. This prompted the study described later. A 
number of methods are currently available for studying 
genome-level evolutionary change. Some of them operate 
on phyletic patterns (presence/absence of genes across gen- 
omes), ignoring evolutionary relationships between genes in 
gene families (CsCios 2010). This is particularly problematic 
when analyzing genomes of distantly related organisms, be- 
cause a high degree of divergence over long periods of evo- 
lutionary time can obscure many duplications, losses, and 
transfers, and they will not be reflected in the phyletic profile. 
The recently described AnGST program (David and Aim 201 0) 
performs a parsimony-based gene-tree species-tree reconcili- 
ation which takes gene phylogeny into account. This tool can 
be applied to a genome-scale data set. It also incorporates 
evolutionary events relevant for bacterial genomes: gene du- 
plication, loss, and HGT. The method implemented in AnGST 
is also somewhat robust to gene phylogeny reconstruction 
errors through bootstrap tree amalgamation, the procedure 
of resolving uncertainties by incorporating reconciliation into 
the tree-building process. The gene tree with the lowest rec- 
onciliation cost is chosen from a collection of trees consistent 
with the set of bipartitions present in the input gene trees 
(David and Aim 2010). One main disadvantage of this ap- 
proach is that it assumes a tree-like relationship between gen- 
omes in the data set. The problems associated with this 
assumption are discussed in greater details in the Results 
and Discussion section, although it presents a background 
tree on which other events can be mapped. There are a few 
further disadvantages with use of AnGST: 1) It requires a 
bifurcating species tree, which can be hard to obtain for 
large sets of distantly related organisms. There is a gene-tree 
species-tree reconciliation tool which can be used with a mul- 
tifurcating species tree (Berglund-Sonnhammer et al. 2006), 
but it treats only duplication and loss, but not horizontal trans- 
fer. 2) AnGST employs parsimony to evaluate the inferred 
gene history; therefore, it requires somewhat arbitrary 
choice of event penalties and ignores branch lengths of the 
species tree. However, the only method allowing inference of 
a detailed gene history through species-tree guided gene-tree 
reconstruction implemented in a reliable statistical framework 
does not incorporate HGT events, which are prominent fea- 
tures of bacterial evolution (Akerborg et al. 2009). Consider- 
ing our need to account for gene-tree topology for distantly 
related species, together with the relevance of HGT events for 
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bacterial evolution, we therefore decided to use AnGST to 
study the evolution of PVC genomes. We inferred rates and 
patterns of gene birth, duplication, loss, and lateral transfer in 
PVC super-phylum evolution, and report the results of this 
large-scale analysis of gene family dynamics and lineage- 
specific patterns. To obtain insights into the evolution of 
host association in PVC bacteria, we examined the taxo- 
nomic affiliation of sequences similar to genes involved in 
carbohydrate metabolism in the A. muciniphila genome, 
examining the topology of every glycoside hydrolase and 
glycoside transferase-specific gene family. We found that 
the sequences most closely related to A. muciniphila 
genes that are potentially involved in mucin-degradation 
are from the genomes of various Bacteroidetes and Acido- 
bacteria species. We also discovered a highly conserved 
genetic module preferentially present in PVC species pos- 
sessing intracellular membranes, and possibly associated 
with the complex cell plan. The key protein of the module 
contains a DUF1 501 domain and appears to have emerged 
on the PVC lineage. The presence of signal peptides and 
twin-arginine motifs in the module's proteins, as well as 
putative cytochrome-specific signatures, suggests involve- 
ment with electron transport and membrane targeting, al- 
though the exact function/localization is unknown. These 
genes represent promising candidates for future functional 
studies, enabled by recent advances in genetic tool devel- 
opment for PVC organisms (Domman et al. 2011; Jogler 
etal. 2011). 

Materials and Methods 

Sequence Data 

To assemble a genome data set, we collected 26 sequenced 
genomes for organisms of the PVC super-phylum (all available 
genomes of Planctomycetes, Verrucomicrobia, and Lenti- 
sphaerae and six genomes of species representing five different 
chlamydial genera). We also included 73 representatives of 
other bacterial phyla with completely sequenced genomes, 
to obtain information on the evolutionary origin of genes 
in PVC genomes. The final list of 99 analyzed species is 
shown in supplementary table S1, Supplementary Material 
online. Protein sequences for genomes of interest were down- 
loaded from GenBank (Benson et al. 2005). Taxonomic and 
ecological information for every organism in the data set was 
extracted from the GOLD database (Pagani et al. 2012). 
Phylum-level taxonomic units were used as taxon identifiers 
(supplementary table S1, Supplementary Material online). We 
included selected bacteria distantly related to the PVC 
super-phylum, to represent as much sequence diversity as pos- 
sible within the constraints of computational time. Included in 
the data set were 35 host-associated organisms (pathogens, 
mutualists, or commensals) as well as 64 free-living soil and 



aquatic bacteria, to represent organisms with different ecolo- 
gical associations. 

Gene Families, Alignments, and Gene Trees 

Gene families were identified using OrthoMCL 1.4 (Li et al. 
2003), with an inflation value of 1.5 and the threshold for 
expectation value for BLAST (Altschul et al. 1990) search set 
at 1 x 10 -4 Protein sequences were aligned using MUSCLE 
v3.7 (Edgar 2004) with default settings. Phylogenetic trees 
were reconstructed using PhyML (Guindon and Gascuel 
2003) implementing the WAG + I + GAMMA evolutionary 
model (Reeves 1992; Yang 1993; Whelan and Goldman 
2001), collecting five bootstrap samples for the bootstrap 
amalgamation procedure linked to gene-tree species-tree rec- 
onciliation. The model was selected as the best model for 44 
gene families, tested independently (discussed later). It was 
applied to all 10,296 gene families, even those not specifically 
tested. Gene phylogeny for DUF1 501 -carrying proteins was 
reconstructed using PhyML and the WAG + I + GAMMA evo- 
lutionary model; however, it was not tested with bootstrap 
analysis. Domain architecture of genomic loci containing 
DUF1 501 -carrying proteins was visualized using the iTOL 
web-server (Letunic and Bork 201 1). 

Species Tree Reconstruction and Molecular Dating 

A species tree topology was recovered for all 99 organisms in 
the data set. Forty-four gene families, represented by exactly 
one gene in every genome under consideration, were used as 
a set of phylogenetic markers. The evolutionary model of best 
fit was determined for every phylogenetic marker using 
ProtTest (Abascal et al. 2005). We found WAG + I + 
GAMMA to be the best-supported general model of evolution 
for all 44 groups. Phylogeny reconstruction was subsequently 
performed for every gene family using PhyML and the 
WAG + I + GAMMA evolutionary model, performing 20 boot- 
strap runs. The trees resulting from every bootstrap run, for 
every phylogenetic marker, were summarized using the con- 
sense program in the PHYLIP package (Felsenstein 1989) with 
default settings (extended majority rule). We found one fully 
resolved topology to be supported by the data set. This tree 
was rooted using the common ancestor of phyla Firmicutes 
and Tenericutes as an outgroup because this node of bacterial 
phylogeny was reported to be the most ancestral node in 
previous studies (Ciccarelli et al. 2006; David and Aim 2010) 
(supplementary fig. S1, Supplementary Material online). 

Species tree topology was used to determine divergence 
times using PhyloBayes (3.3b) (Lartillot et al. 2009). We con- 
ducted two PhyloBayes runs with a CIR process model of rate 
correlation and five sets of temporal constraints that could be 
directly linked to fossil or geochemical evidence, or were pre- 
viously reported in the literature (supplementary fig. S1 and 
table S2, Supplementary Material online). We collected 
10,000 samples from each chain. Samples from every chain 
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were summarized with 2,000 sample points discarded as a 
burn-in and only including every 10th sample value to 
reduce autocorrelation between samples. Both chains led to 
similar results (supplementary fig. S2, Supplementary Material 
online). Final divergence time estimates were obtained from 
one chain following a burn-in of 2,000 cycles, after which 
trees were sampled every 10 cycles until the 10,000th cycle. 
Credible intervals of divergence time for every species tree 
node over the final chronogram are shown in supplementary 
figure S3, Supplementary Material online. We also used a 
concatenated alignment of 44 gene families represented by 
one gene in every genome in the data set, to test whether the 
data supported a phylogenetic network, using the Neighbour- 
Net algorithm implemented in the SplitsTree software pack- 
age (Huson and Bryant 2006). 



difference in genome size between ancestor and descendant 
across the branches of the species tree (genome flux): 

/=196 / T 

E V(P GS /- cGS ') 

n _ /=1 



where pGS,- and cGS/ denote ancestral (parent) and descend- 
ant (child) genome size for branch /', and 1 96 is the number of 
branches in the species tree. 

We included all gene families containing two or more se- 
quences in the genome flux analysis and used loss and speci- 
ation costs fixed at 1 .0 and 0.0, respectively. Minimal genome 
flux was obtained with HGT penalty equal to 5, and duplica- 
tion penalty equal to 3 (supplementary fig. S4, Supplementary 
Material online). 



Ancestral Character Reconstruction 

Ancestral state reconstruction for presence/absence of intra- 
cellular membranes in bacteria included in our data set was 
performed using Mesquite 2.75 (http://mesquiteproject.org, 
last accessed December 13, 2012), implementing the model 
of discrete character evolution with two parameters (character 
gain and loss rates) (Pagel 1999). We used the presence/ab- 
sence of intracellular membranes as described in supplemen- 
tary table S1, Supplementary Material online, and the species 
tree (supplementary fig. S1, Supplementary Material online). 
Mesquite ignores taxa for which the morphological character 
is not provided (species for which intracellular membrane pres- 
ence/absence is unknown) and corresponding ancestral 
nodes. We inferred maximum likelihood estimates for gain 
and loss rates, the likelihood of the character distribution 
given the species tree and proportional likelihoods of both 
states (compartmentalized/uncompartmentalized cell) for 
every node of the species tree, except for those which lead 
to nodes with unknown cellular structure (supplementary fig. 
S10, Supplementary Material online). 

Analysis of Genome Content Evolution 

Genome content evolution was inferred using parsimonious 
gene-tree species-tree reconciliation as implemented in the 
AnGST program (David and Aim 2010). The most parsimoni- 
ous reconciliation was inferred for a gene tree of every gene 
family containing two or more sequences, using event penalty 
values obtained as described in the section later. We did not 
enforce time-consistent reconciliation, allowing transfer to 
occur between any two lineages. The numbers of gain, loss, 
and transfer events were extracted from AnGST output for 
every gene family for every species tree lineage. 

Genome Flux Analysis 

We employed a method originally reported in (David and Aim 
2010) to infer event penalties leading to minimal average 



Analysis of Gene Loss, Duplication, and Transfer Rates 

Although the most parsimonious gene-tree species-tree rec- 
onciliation provides direct assessment of a number of events 
(duplication, losses, and HGT) on every lineage of the species 
tree, it does not infer rates of those events. Here, we esti- 
mated rates of gene loss, duplication, and horizontal gene 
transfer across the lineages of the species tree in the following 
fashion: 

RI/ = NI / /(^pGS / ); 
Rd / = Nd / /(L / ? c pGS / ); 

Rt/ = Nt///./; 
Rb/ = Nbj/Lr, 

where RI /# Rd,-, Rt/, and Rb/, are the rate of losses, duplications, 
transfer, and gene birth events, respectively, occurred on the 
branch /; NI /# Nd /# Nt/, and Nb/ are the number of losses, du- 
plications, transfer, and gene birth events, respectively, 
occurred on the branch /. /.,- is the branch length of branch / 
in million years for transfer and birth rate and in thousand 
years for loss and duplication; pGS/ is the ancestral (parent) 
genome size for branch /'. 

Distribution of gene loss, duplication, and birth rates across 
the lineages of the species tree (after removing outliers and 
branches having rate equal to zero) and log transformation 
were modeled in a Bayesian framework as a normal 
distribution: 

Robs/|/x, r ~ Norm(/x, r); 

Using non-informative independence priors over mean ([i) and 
precision (x); 

/x~ Norm(0.0, 1e-6); 
r-7(0.001,0.001); 

In the case of transfer rates, data (after removing outliers 
and branches having rates equal to zero and log 
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transformation) were modeled in a Bayesian framework as a 
mixture of two normal distributions: 

Robs, | NormO^r-i), 
+7T2Norm(/x 2 , r 2 ); 

Using noninformative independence priors over means 
precisions (x y ), and weights (Try): 

/x y -Norm(0.0,1e-6); 

r y S 7(0.001, 0.001); 
(7t^7t 2 ) ~ Dirchlet(1, 1); 

Predictive values were sampled from the posterior predict- 
ive distribution and compared with observed data; average 
differences between observed and predicted values were 
used to assess every model: 

Rpred/l/x, r~ Norm(/x, r)for loss duplication and birth rates or 

Rpred/l/x-i NormOx^ti) 

+7T2Norm(/x2, t 2 )for transfer rate; 

Dj = Robs/ - Rpred,- 
Dav = mean(D/) 

QQ plots were also examined to see whether observed and 
the predicted data were distributed in the same way. 

The model was implemented in JAGS-3.2.0 using the rjags 
R package interface. Three chains of MCMC were run for 
25,000 generations (for loss, duplication, transfer, and birth 
rates) sampling from the posterior distributions of the param- 
eters: [i, x, |iy, xy, rry Rpred/, D„ and Dav. Only every 5th sample 
was retained to reduce autocorrelation between samples and 
the initial 5,000 samples were discarded as a burn-in. Conver- 
gence of the chains was assessed using trace plots. Technical 
results of the modeling protocol are summarized in supple- 
mentary figures S5-S8, Supplementary Material online. 

D/ values were also used to identify extreme data values. If 
85% or more, D,- values for a particular / were smaller or larger 
than zero, we classified branch / as a branch with decelerated 
or accelerated rate of events (losses, duplications, transfers, 
and birth events). 

Analysis of Donor Lineages in Transfer Events 

We used a binomial test to detect statistical over- 
representation of various evolutionary lineages among those 
acting as donors in HGT events, assuming each transferred 
gene to be an independent event and all organisms to have 
equal ecological opportunity to transfer genes. The probability 
of an organism acting as an HGT donor was assumed to be 
proportional to genome size. We transformed the P values 
obtained for every donor lineage to generate heatmap 
charts. The transformation scales all P values to range from 
-1 to 1, where values approaching -1 correspond to 



underrepresented lineages, and values approaching 1 corres- 
pond to overrepresented lineages. Heatmap charts were gen- 
erated using the gplots R package. 

Phyletic Profiling 

We employed phyletic profiling to search for gene families 
exhibiting profiles correlated with the presence/absence of 
intracellular membranes. We excluded PVC organisms with 
unknown cellular structures from this analysis (supplementary 
table S1, Supplementary Material online). We used Pearson 
correlation coefficients as a measure of similarity between 
membrane presence/absence and phyletic profile for every 
gene family, and retrieved protein groups with correlation co- 
efficients above 0.8. We also used a BLASTP search with 
Gemmata obscuriglobus protein gi: 168700061 to charac- 
terize gene families with similarity to membrane coat-like 
proteins. 

Analysis of Biological Functions 

To link gene families to metabolic pathways, proteins (from 
one representative organism of every species with a KEGG- 
annotated genome) and their cellular pathway annotations 
were retrieved from the KEGG database. Gene families 
were assigned to pathways based on either existing KEGG 
annotation for a gene (for those genomes which were anno- 
tated in KEGG) or annotation of BLAST best hit for the genes 
of a gene family. If at least half of the genes from a gene 
family were assigned to one particular pathway, the entire 
protein family was assigned to this biological pathway. We 
employed a binomial test to determine statistical over- and 
underrepresentation of cellular pathways among the gene 
families affected by gene birth, duplication, transfer, and 
loss processes on certain evolutionary lineages. We trans- 
formed the P values obtained for every pathway to generate 
heatmap charts. The transformation scales all P values to 
range from -1 to 1, where values approaching -1 corres- 
pond to underrepresented pathways, and values approaching 
1 correspond to overrepresented pathways. 

The CAZY-based classification of carbohydrate metabo- 
lism-related genes from the A. muciniphila genome was 
downloaded from http://www.cazy.org (last accessed 
December 13, 2012). 

Heatmap charts were generated using the gplots R 
package. 

Results and Discussion 

Species Tree 

To characterize genome content evolution in the PVC 
super-phylum, we assembled a set of all available finished or 
draft genomes of organisms from phyla Planctomycetes, 
Verrucomicrobia, and Lentisphaerae (20 genomes) and six 
genomes of organisms belonging to phylum Chlamydiae. 
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We also included 73 species representing other bacterial 
phyla. We reconstructed a background "species tree" for 
the entire data set of 99 genomes using a set of phylogenetic 
markers present in every genome in one copy. The recovered 
species tree topology (fig. 1 and supplementary fig. S1, 
Supplementary Material online) was largely similar to species 
relationships reported in previous studies (Wagner and Horn 
2006; Hou et al. 2008; Pilhofer et al. 2008; Kamneva et al. 
2010). The presence of non-PVC species in the data set 
allowed us to identify a sister clade of the PVC super-phylum, 
containing phyla Spirochaetes, Bacteroidetes, and Chlorobi 
(fig. 1 and supplementary fig. S1, Supplementary Material 
online). This relationship was also recovered in previous studies 
conducted using a different set of species and phylogenetic 
markers (Hou etal. 2008). 

Although we were able to recover a species tree, we also 
detected a strong signal for a phylogenetic network at the 
root of the tree (supplementary fig. S9, Supplementary 
Material online). This highlights the role of HGT in bacterial 
evolution, and supports the current view on evolution of living 
organisms which incorporates not only the signal of vertically 
inherited genetic information (unique ancestor/descendant re- 
lationship), but also the signal of lateral transfer events (several 
ancestors for an organism) (Mclnerney et al. 201 1). Although 
significant empirical evidence of a network-like evolutionary 



history has accumulated for both pro- and eukaryotic organ- 
isms (Fitzpatrick et al. 2006; Sullivan et al. 2006; McDonald 
et al. 2008; Retchless and Lawrence 2010), the majority of 
methods evaluating gene history assume tree-like evolutionary 
relationships between genomes/organisms, as a background 
on which horizontal processes can occur. Several factors com- 
plicate inference of tree-like species relationships in bacteria. 
The vague definition of bacterial species arises from the frag- 
mented nature of bacterial speciation and poorly defined spe- 
cies boundaries (Fraser et al. 2007; Retchless and Lawrence 
2010). In the early stages of speciation, genetic isolation is 
normally established at genomic loci containing niche-specific 
genes, whereas recombination persists in many other regions 
of the genome (including those encoding genes that are 
highly connected in the genetic network and resistant to 
HGT, such as information processing genes) (Retchless and 
Lawrence 2007). However, after genetic isolation has been 
established for major parts of the genome, illegitimate recom- 
bination, or transfer events facilitated by various mechanisms 
may still be a source of genetic material from distantly or 
closely related organisms. At this stage, genes with low con- 
nectivity in the genetic network such as those encoding anti- 
biotic resistance, photosynthesis, or other metabolism-related 
genes are transferred (Rivera et al. 1998; Cohen et al. 2010). 
Multiple genetic markers are often used to analyze 
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Fig. 1 . — Evolutionary relationships between PVC organisms included in the analysis. Species tree topology was recovered for the entire set of 99 bacterial 
species from various bacterial phyla, as a consensus tree averaging over gene trees of 44 phylogenetic markers. Divergence times were estimated using a 
concatenated alignment of all 44 phylogenetic markers. Non-PVC clades were collapsed for clarity and corresponding lineages are shown in gray. Names of 
PVC phyla are shown on the right. The full species tree is presented in supplementary figure S1, Supplementary Material online. 
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phylogenetic relationships between bacterial species. Patchi- 
ness in recombination and frequent HGT events are often 
apparent when clear phylogenetic separation or coherence 
of species can be established using one set of markers, 
whereas other sets support alternate groupings (Hanage 
etal. 2005; Fraser et al. 2007). 

Evolution of the Compartmentalized Cell Plan in the PVC 
Super-Phylum 

One of the questions motivating the study of the evolutionary 
history of the PVC super-phylum is the origin of PVC-specific 
intracellular membranes. The lineage or lineages where the 
genetic determinants of PVC-specific cellular structures were 
acquired is unknown. To address this question, we performed 
ancestral character reconstruction using the model described 
in (Pagel 1999). We inferred the presence/absence of intracel- 
lular membranes in ancestral species corresponding to the 
nodes of the species tree (supplementary fig. S10, Supplemen- 
tary Material online) using data on the presence/absence of 
intracellular membranes obtained from published studies 
(Fuerst 2005; Lee et al. 2009, Fuerst JA, personal communi- 
cation) (supplementary table S1, Supplementary Material 
online). According to our inference, the compartmentalized 
cell plan is likely to have initially emerged on the PVC phylo- 
genetic lineage; therefore, the genes responsible for the de- 
velopment of cellular compartmentalization should also have 
been first acquired/emerged on this phylogenetic lineage 
within the super-phylum. It is also possible that the compart- 
mentalized cell plan was reinvented on some internal lineages 
of Verrucomicrobia- and lentisphaerae-specific clades, but 
additional microscopy experiments are needed to elucidate if 
and when that happened. 

Gene Family Dynamics in the PVC Super-Phylum 

We employed OrthoMCL clustering to derive putative gene 
families. We identified 17,608 homologous families contain- 
ing at least three sequences, 9,444 containing two members, 
and 63,679 singletons. To assess gene family dynamics, we 
reconciled every gene tree with the species tree using the 
parsimony-based method reported in (David and Aim 2010). 
The algorithm uses HGT, gene duplication, and gene loss 
events to explain discrepancies between gene and species 
phylogeny. Every event is associated with a penalty, from 
which the algorithm then infers reconciliation of minimal 
cost. The use of varying event penalties within a parsimony- 
based gene-tree species-tree reconciliation procedure allows 
the recovery of every possible reconciliation as the most par- 
simonious one. Therefore, careful justification of event penal- 
ties is required. We conducted a search for HGT and 
duplication cost values which lead to minimum average 
genome flux (the difference between the ancestor and des- 
cendant genome sizes over every lineage of the species tree). 
We used genome flux as an external criterion to evaluate the 



accuracy of every inferred evolutionary scenario of genome- 
level changes. The main assumption of this step was that the 
genome sizes of the ancestor and descendant are correlated. 
This method was also employed in the original article report- 
ing use of the AnGST tool (David and Aim 2010). Although 
genome sequences are expected to harbor segregating vari- 
ations along with sites invariant across the entire population, 
we treated all genomes as if they represent a genome of the 
entire species. This was due to the lack of population data for 
the majority of species under consideration, and was thus a 
necessary assumption of this analysis. 

Gene-tree species-tree reconciliation allows explicit infer- 
ence of the evolutionary history of every gene family in the 
data set, in terms of gene family origins, gene duplication, 
loss, and transfer events. It also allows evaluation of 
genome size for every ancestral genome. We additionally con- 
ducted Bayesian statistical modeling of evolutionary rates (see 
Materials and Methods section for details) and identified 
branches with accelerated and decelerated rates of gene 
loss, duplication, birth, and acquisition via HGT. The summar- 
ized results of this analysis for the PVC clade are depicted in 
figure 2; data from the full data set are illustrated in supple- 
mentary figure S1 1, Supplementary Material online. Here, we 
identified potential HGT events based on gene-tree 
species-tree reconciliation. However, this assertion should be 
viewed as "the best guess" for the origin of the sequence in 
the genome considering the limited number of extant lineages 
included in the data set, as well as the mosaic nature of the 
majority of studied genomes (Mclnerney et al. 201 1). 

Our results suggest that the common ancestor of all PVC 
organisms had a genome containing 3,106 genes, of which 
786 were members of multigene families. Three hundred 
fifty-nine gene families emerged on the lineage leading to 
the common ancestor of all PVC organisms. Only a small pro- 
portion (7) of these 359 families encodes membrane coat-like 
proteins; therefore, other gene families acquired on this lin- 
eage could contain previously uncharacterized determinants 
of the PVC-specific cell plan. Two hundred twenty-nine gene 
families were acquired via HGT on the lineage leading to the 
last common ancestor of all PVC species. 

After the Planctomycetes split from the rest of the 
superphylum, lineages leading to Rhodopirellula baltica, 
Blastopirellula marina, Planctomyces maris, and the common 
ancestor of Planctomyces and Pirellula species underwent 
birth of gene families. We also observed elevated gene birth 
rate, and expansion of existing families through gene duplica- 
tion, on the lineage leading to G. obscuriglobus, with large 
numbers of genes in multigene families (2,840 genes out of 
7,989). This suggests that evolution on the G. obscuriglobus 
lineage has been affected by changes after gene duplication, 
although we did not test this hypothesis; additional factors 
affecting duplicate retention include dosage balance, when 
the gene is retained due to required stoichiometric balance 
with other proteins in the system, and selection on abundance 
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Fig. 2. — Genome content evolution in PVC super-phylum. Events of genome content evolution were mapped onto lineages of the species tree; only the 
PVC clade is shown here. Numbers at every node, either ancestral or extant, represent genome size and number of genes in multigene families (for instance 
the Isosphaera pallida genome contains 3,722 genes, out of which 604 genes are predicted to be members of multigene families). Numbers above and 
below every lineage represent the number of birth/duplication/transfer and loss events, respectively, predicted to occur on the branch. Numbers shown in red 
or blue correspond to accelerated or decelerated rates of events on the branch, classified as described in Materials and Methods section (for instance on the 
lineage leading to Verrucomicrobium spinosum 1,761 births, 646 duplications, 1,292 transfers, and 265 loss events occurred. This observed event count 
implies elevated gene birth and duplication rates on this lineage and low gene loss rate). 



of gene product of nondivergent duplicates (Innan and 
Kondrashov 2010; Dittmar and Liberies 2011). The ancestor 
of Verrucomicrobia, Chlamydiae, and Lentisphaerae possessed 
a relatively large genome containing 2,557 genes. This ances- 
tral gene set was shaped primarily by genome shrinkage on 
the lineage leading to the ancestor of all the Chlamydiae, and 
within the chlamydial clade. We did not detect elevated rates 
of gene loss in the chlamydial clade, but rather found evidence 
for low rates of gene family acquisition through various mech- 
anisms. Generally, low rates of gene birth, duplication, and 
transfer were detected on several lineages of the chlamydial 
clade, especially on the lineage leading to the common ances- 
tor of obligate intracellular pathogens of the genera 
Chlamydia and Chlamydophila. Gene gain via birth of new 
gene families prevailed on the lineage leading to the 
common ancestor of Lentisphaerae and Verrucomicrobia. 
This was followed by deceleration of gene birth and duplica- 
tion rates on the lineage leading to the common ancestor of 
Lentisphaerae. Subsequently, an accelerated gene duplication 



process occurring on the Lentisphaera araneosa lineage re- 
sulted in the high number of genes in multigene families in 
the L. araneosa genome, again pointing to the role of protein 
evolution after gene duplication. Gene gain contributed to 
genome content changes on several lineages of phylum 
Verrucomicrobia (Verrucomicrobium spinosum, Opitutaceae 
bacterium, and several ancestral lineages). Low rates of gene 
birth and duplication were detected on the lineage leading to 
the common ancestor of V. spinosum and A. muciniphiia, 
which may have played a role in the emergence of the rela- 
tively small genome of A. muciniphiia found in close associ- 
ation with the eukaryotic host. 

We also detected patterns consistent with a high degree of 
genome plasticity and possibly emergence of new phenotypes 
on several phylogenetic lineages (with elevated rates of pro- 
cesses bringing new genes into the genome [gene birth and 
HGT] and gene loss for the same lineage). We observed this 
pattern for a number of ancestral lineages within the 
super-phylum including the following: lineages leading to 
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last common ancestors of all PVC organisms; O. bacterium, 
Opitutus terrae, and Verrucomicrobiae bacterium; A. mucini- 
phila, V. spinosum, Chthoniobacter flavus, Methylacidiphilum 
infernorum, and Pedosphaera parvula; species belonging to 
order Planctomycetales. For two of those lineages, elevated 
HGT rates are also consistent with a highly chimerical nature 
of the emerging genome. 

This report summarizes for the first time genome-level pro- 
cesses that appear to have occurred on various lineages of the 
PVC super-phylum. Here, we determined that genome expan- 
sion has driven evolution within Planctomycetes, genome con- 
traction has been a main theme of genome-level changes 
within Chlamydiae, and the balance of these two processes 
has led to genomes of stable size in Verrucomicrobia and 
Lentisphaerae. We observed acceleration of gene birth and 
gene loss rates on the PVC evolutionary lineage, which sug- 
gests that genome content changed significantly in terms of 
present/absent genes, to produce the common ancestor of 
the super-phylum. This indicates that a number of new pheno- 
types might have emerged on this lineage. 

Horizontal Gene Transfer among PVC Organisms and 
Members of Other Bacterial Groups 

Lateral transfer is recognized as a major source of diversity in 
bacterial genomes (Jain et al. 2003). However, an accepted 
model of HGT which would allow prediction of the frequency 
of transfer events between a given pair of species does not 
currently exist. This frequency is known to depend on differ- 
ences in GC content and genome size, as well as ecological 
factors such as carbon source utilization and oxygen tolerance 
(Jain et al. 2003; Smillie et al. 201 1). Other factors that intui- 
tively should affect frequency of HGT between organisms in- 
clude differences in codon usage, di-, and tri-nucleotide 
composition, and divergence between donor and recipient 



organisms, including divergence of regulatory motifs and bio- 
molecular interaction interfaces. The exact mode of physical 
interaction in an ecosystem (whether organisms are physically 
attached to each other and how they might interact, versus 
rare or nonexistent physical contact) must also be critical to 
HGT. However, the modes of interaction between bacterial 
cells in the environment are hard to assess. To delineate the 
extent to which HGT contributes to PVC genome dynamics, 
we traced the number of transfer events between different 
organisms for every gene family within the data set (table 1, 
supplementary table S3 and fig. S12, Supplementary Material 
online). We then determined which lineages were overrepre- 
sented among donors of transferred genes for every lineage of 
the species tree. We used a simple binomial test with the 
probability of an organism (extant or ancestral) acting as an 
HGT donor being proportional to its genome size, but assum- 
ing that all transfer events were independent and all genomes 
had equal opportunity to contribute. For simplicity, we did not 
account for all the other factors affecting HGT frequency. 
Deviation from the random process should therefore be 
viewed as a sign that one of the factors listed above has af- 
fected HGT between the species. It appeared that HGT has 
preferentially occurred between organisms of the same 
phylum, which is attributable to the existence of fewer diver- 
gent regulatory mechanisms within a group of related species 
and general similarity in genome structure. It is also possible 
that lifestyle properties are not radically different between 
closely related bacterial taxa and therefore co-occurrence in 
the environment might also contribute. Three main HGT high- 
ways, all occurring within Planctomycetes, were detected 
within the PVC clade: a large number of sequences were 
transferred from P. staleyi to B. marina (149 events), and 
from P. maris and Planctomyces brasiliensis to P. limnophilus 
(120 and 146 events, respectively). The species in the list are 
referred to as "donor organisms" for simplicity; however, they 



Table 1 




Organisms (Extant or Ancestral) Frequently Acting as Donors in Lateral Transfer Events 


Recipient 


Donor (No. of Transfer Events); Only Organisms Frequently Acting as Donors Are 




Shown (P< 1e-8) 


Kuenenia stuttgartiensis 


Deferribacteres (18); Desulfovibrio vulgaris Miyazaki F (25); Geobacter lovleyi SZ (39); 




Synergistetes (1 5); HydrogenobacterlPersephonellalSulfurihydrogenibium (1 5) 


Gemmata/lsosphaera/Pirellulaceae/Planctomyces 


Candidatus Solibacter usitatus Ellin6076 (45) 


Gemmata obscuriglobus UQM 2246 


Candidatus S. usitatus Ellin6076 (49) 


Isosphaera pallida ATCC 43644 


Chloroflexus aggregans DSM 9485 (22) 


Victivallis vadensis ATCC BAA-548 


Spirochaeta sp. Buddy (21); Treponema azotonutricium ZAS-9 (22); D. vulgaris 




Miyazaki F (25) 


Methylacidiphilum infernorum V4 


a-Proteobacteria (22) 


Pedosphaera parvula Ellin514 


T. saanensis SP1PR4 (34); Candidatus S. usitatus Ellin6076 (78) 


Opitutaceae/Opitutus/Verrucomicrobiae 


Candidatus S. usitatus Ellin6076 (29) 


Verrucomicrobiae bacterium DG1235 


Candidatus S. usitatus Ellin6076 (38) 


Opitutus terrae PB90-1 


Candidatus S. usitatus Ellin6076 (81); D. vulgaris Miyazaki F (24) 


Chthoniobacter flavus Ellin428 


Sorangium cellulosum So ce 56 (60); Candidatus 5. usitatus Ellin6076 (63) 


Akkermansia muciniphila ATCC BAA-835 


Bacteroides fragilis NCTC 9343 (47); D. vulgaris Miyazaki F (17) 
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reflect merely the best guesses from currently available gen- 
omic sequences. Interpretation of the data assumes that life- 
style properties are not radically different between closely 
related bacterial taxa if the true donor is not included in the 
data set. 

There was also evidence for HGT events between phyla. 
Identification of HGT donors external to the PVC super- 
phylum could provide insights into the evolution of the recipi- 
ent's ecology and cell biology, which is especially relevant for 
these poorly characterized organisms. Although the ecological 
niche of V. spinosum is unknown, it has been proposed that 
this organism is capable of host associations (Sait et al. 201 1). 
According to our analysis, a number of genes in the V. spino- 
sum genome were derived from HGT events from organisms 
outside the super-phylum, including Candidatus Solibacter usi- 
tatus Ellin6076 (P value = 9e-07, 42 genes), Desulfovibrio 
vulgaris Miyazaki F (P value = 5e-06, 22 genes, including a 
number of genes encoding type III secretion), and Sorangium 
cellulosum So ce 56 (P value = 8.972e-04, 38 genes). 
Although these donor organisms are typically considered 
free-living, Desulfovibrio species are also known to be host 
associated, and in some cases pathogenic (Goldstein et al. 
2003). This suggests that V. spinosum might also exhibit a 
host-associated lifestyle; however, the mode of this interaction 
is unknown. In contrast, G. obscuriglobus, a free-living aquatic 
organism (Franzmann and Skerman 1984), also acquired 
(according to our computational analysis) a large number 
of genes via HGT from Candidatus 5. usitatus Ellin6076 
(P value <1e-08, 49 genes), Sor. cellulosum So ce 56 
(P value = 2e-07, 42 genes), and the ancestor of all alpha- 
proteobacteria (P value = 4.23e-05, 17 genes), but only 12 
genes from D. vulgaris Miyazaki F (Pvalue = 8.5e-03). 

In contrast to the quite high HGT activity for these (primar- 
ily) free-living lineages, we detected a very small number of 
HGT events involving chlamydial species, which highlights the 
consequences of the host-dependent lifestyle of Chlamydiae 
and the distinct evolutionary forces acting on their genomes. 
However, five HGT events were detected between the ances- 
tor of all Chlamydiae and the ancestor of spirochaetes 
(P value = 9.2 1e-06) and an additional five genes were 
acquired from the common ancestor of cyanobacteria 
(P value = 5.56e-06). 

Here, we characterized the contribution of HGT to PVC 
super-phylum evolution in a quantitative manner, and deter- 
mined that, according to our analysis, many genes were 
acquired by extant and ancestral PVC organisms from 
Acidobacteria (in particular from Candidatus 5. usitatus), 
delta Proteobacteria (in particular from Sor. cellulosum and 
D. vulgaris) and, in the case of A. muciniphila, from 
Bacteriodetes (in particular from Bacteroides fragilis NCTC 
9343, described later). Large numbers of observed transfer 
events between PVC species and other bacterial organisms 
presumably indicate close ecological interactions between 
bacteria in the environment. 



Gene Birth, Duplication, Transfer, and Loss in Different 
Biological Functions 

Major evolutionary events are observed at the level of the 
individual gene/protein but act in the context of broader cel- 
lular biology. We used KEGG (Kanehisa et al. 201 0) metabolic 
pathways to classify gene families and systematically identify 
pathways affected by gene birth, duplication, transfer, and 
loss processes. We elucidated overrepresented/underrepre- 
sented pathways among gene families affected by these 
events. We found that gene birth predominantly affected 
genes of unknown function (according to KEGG) and other 
pathways were significantly depleted; cluster 1 mostly 
contained lineages leading to extant species (supplementary 
fig. S13/\, Supplementary Material online). However, within 
cluster 1, there were some outliers, for instance cell motility 
genes were overrepresented among genes born on the lin- 
eage leading to the common ancestor of Chlamydia and 
Chlamydophila. Several ancestral lineages of different phyla 
form cluster 2, where some pathways were affected by 
gene birth to different extents. For example, pathways for 
metabolism of terpenoids, polyketides, and other secondary 
metabolites were enriched among genes born on lineages 
leading to the common ancestor of Lentisphaerae species 
and the Chthoniobacter/Methylacidiphilum/Akkermansia/ 
Verrucomicrobium phylogenetic lineage. Proteins involved in 
folding, sorting, and degradation were affected by gene birth 
on Chlamydiae, Lentisphaerae/Verrucomicrobia/Chlamydiae, 
and P. maris/P. brasiliensis lineages. Cluster 3 includes mostly 
ancestral lineages where genes involved in transcription and 
genes of unknown function were considerably affected by 
gene birth. 

In cases of HGT, we were able to distinguish three main 
clusters of phylogenetic lineages (supplementary fig. S13£, 
Supplementary Material online). In cluster 1, HGT did not 
affect genes of unknown function but instead influenced 
the majority of other pathways. In cluster 2 and 3, genes of 
unknown function, lipid metabolism, and xenobiotic biodeg- 
radation and metabolism were strongly affected by HGT. In 
cluster 2, genes involved in cell motility and signal transduction 
were affected, whereas in the majority of lineages from cluster 
3, membrane transport, glycan biosynthesis and metabolism, 
metabolism of other amino acids, and biosynthesis of other 
secondary metabolite pathways were influenced by HGT. 

We identified three major clusters of evolutionary lineages 
based on how genes in various functional categories changed 
via gene duplication on these lineages (supplementary fig. 
S13C, Supplementary Material online). On lineages from clus- 
ter 3, gene duplication mainly affected genes of unknown 
function. Other functional categories were significantly under- 
represented, with the exception of signal transduction and cell 
motility pathways on lineages leading to several extant or an- 
cestral lineages within Verrucomicrobia- and Planctomycetes- 
specific clades. In cluster 2, genes of unknown function were 
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significantly overrepresented among duplicated genes. How- 
ever, the remaining functional categories were not depleted 
but rather affected at average rate. In cluster 1, genes of un- 
known function were not affected by gene duplications, 
whereas a number of general metabolic and genetic informa- 
tion processing functions were. With regard to how gene loss 
influences various cellular functions on different evolutionary 
lineages within the PVC super-phylum, we distinguished three 
main groups of phylogenetic lineages (supplementary fig. 
S13D, Supplementary Material online). Cluster 2 mostly 
included extant and ancestral lineages of the Chlamydiae- 
specific clade; the most affected cellular functions in this 
cluster were amino acid metabolism and energy metabolism, 
which supports our current understanding of genome evolu- 
tion in phylum Chlamydiae (Kalman et al. 1999). 

Enrichment of unclassified genes (unassigned to specific 
pathways) among those influenced by particular evolutionary 
events, as well as the generally large number of unclassified 
gene families within PVC genomes, might indicate a large 
number of uncharacterized pathways evolved within PVC 
species. 

HGT Example: Carbohydrate Utilization in A. muciniphila 

Organisms inhabiting the same environment tend to have a 
large number of shared genes; this process is driven by adap- 
tation to the same ecological conditions of the habitat as well 
as greater opportunity for HGT due to close physical contact 
of cells. Consistent with this, a large number of transfer events 
were detected between A. muciniphila and B. fragilis (organ- 
isms associated with the human digestive tract), 47 genes, 
P value < 1e-08. Both of these organisms are capable of 
mucin degradation (Roberton and Stanley 1982; Derrien 
et al. 2004) and A. muciniphila is the only confirmed 
host-associated Verrucomicrobia organism included in our 
analysis; therefore, all determinants of this new lifestyle 
should be acquired on the A. muciniphila lineage. Horizontal 
acquisition of carbohydrate-binding peptidases (proteins 
implicated in mucin degradation) by A. muciniphila (and 
other PVC organisms) from Bacteroidetes species was recently 
reported (Nakjang et al. 2012). However, the evolutionary 
origin of pathways related to carbohydrate metabolism in A. 
muciniphila has not been described. To address this question 
and explore one example where evolutionary process affects 
molecular function but is acting on the level of particular 
genes, we performed a comprehensive phylogenomic analysis 
of A. muciniphila genes predicted to encode carbohydrate- 
active enzymes by CAZY classification (http://www.cazy.org, 
last accessed December 13, 2012) (Cantarel et al. 2009). 
There are four carbohydrate esterases (three families); 59 
glycoside hydrolases (26 families); and 44 glycoside transfer- 
ases (14 families) encoded in the genome of A. muciniphila. 
Forty-six out of 59 putative glycoside hydrolyses possess a 
predicted signal peptide (LipoP or SignalP prediction) and 



therefore might be potentially involved in mucin degradation 
outside the bacterial cell (supplementary table S4, Supplemen- 
tary Material online). 

We inferred the evolutionary origin of every glycoside hydro- 
lase and transferase gene using a phylogenomics framework. 
Phylum-level taxonomic classification of organisms from the 
closest sister clade was used to identify potential origin for 
every putative glycoside hydrolase or glycoside transferase in 
the A. muciniphila genome (fig. 3). Three groups of glycoside 
hydrolase families were observed when classified based on the 
taxonomic affiliation of phylogenetically closely related se- 
quences. Enzymes of the first group were vertically transmitted 
to A. muciniphila and group together with other sequences 
from verrucomicrobial genomes; they belong to families 98, 
97, 77, 3, and 1 8. Proteins from the second group were closely 
related to genes from Bacteroidetes species, glycoside hydro- 
lase families 1 10, 27, 13,57,31,43, 123, 105, 109, 29, 2, and 
20. The third group included glycoside hydrolase families rep- 
resented by 2 to 4 genes in A. muciniphila. One copy was 
affiliated with a protein from Bacteroidetes species, and the 
others either vertically transmitted to A. muciniphila, or 
shared an origin with proteins in Acidobacteria genomes. 
Glycoside hydrolase families of the fourth group were repre- 
sented by a few (1-3) proteins in the genome and according to 
our classification, closely related to proteins from organisms of 
different taxa or their ancestral species {Lentisphaerae, 
Actinobacteria, Acidobacteria, or Planctomycetes). Families of 
glycoside hydrolases known to be involved in mucin degrad- 
ation (families 27, 29, 2, 20, 95, 35, 36, and 33) (Turroni et al. 
201 0) were present in group number 2 and 3. This suggests a 
dual evolutionary origin of mucin degradation pathways in A. 
muciniphila, where genes in the pathway were obtained from 
two major donors, which also gave rise to sequences in 
Acidobacteria and Bacteroidetes. 

A different trend was observed in the case of glycoside 
transferases, with the main group of glycoside transferase 
families containing genes vertically inherited by A muciniphila. 
The second cluster contained glycoside transferases of varying 
taxonomic affiliation (distantly related Chloroflexi, Synergis- 
tetes, Proteobacteria, Bacteroidetes, or more closely related 
Planctomycetes). This pattern suggests vertical transition for 
the majority of genes involved in carbohydrate biosynthesis. 
The difference between glycoside transferases and glycoside 
hydrolases suggests that glycan utilization is a major mechan- 
ism of microbial survival within the human digestive system 
and highlights the importance of extracellular glycoside deg- 
radation for A. muciniphila fitness. 

Gene Birth Example: Proteins Containing the DUF1501 
and PSCyt Domains in Planctomycetes, Verrucomicrobia, 
and Lentisphaerae 

The discovery of cellular compartmentalization within first 
Planctomycetes, and subsequently Poribacteria and 
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Fig. 3. — Taxonomic assignment of sequences closely related to carbohydrate metabolism genes from Akkermansia muciniphila. (A) Glycoside hydro- 
lases; (B) Glycoside transferases. Heatmap chart was generated using ratios of genes with a certain taxonomic assignment to total number of genes assigned 
to a glycoside hydrolase or transferase family. Dendrograms on the sides of the heatmaps were generated using a hierarchical clustering algorithm. Clusters 
discussed in the main text are marked. Color key is present in the bottom-right corner. 



Verrucomicrobia, established the super-phylum as a model 
group for studying the evolution of biological complexity 
within bacteria (Fieseler et al. 2004; Fuerst 2005; Lee et al. 
2009). The identification of proteins necessary for the transi- 
tion from a simple bacterial cell plan to a more complex 
compartmentalized cellular organization is of great import- 
ance to evolutionary biology. Although some genes have 
been implicated in their role in cellular compartmentalization 
within the super-phylum (Santarella-Mellwig et al. 201 0), gen- 
etic determinants of this process are yet to be determined. 
Within this study, we identified several gene families which 
were detected primarily within the PVC super-phylum and 
asserted to have emerged on this evolutionary lineage (sup- 
plementary table S5, Supplementary Material online). These 
genes might be involved in compartment formation within 
PVC species, although they could also be present in the gen- 
omes due to species relatedness. Several identified gene 
families contained proteins previously reported to be 
PVC-specific membrane coat-like proteins (Santarella- 
Mellwig et al. 2010). The others contained protein domains 
of unknown function. One such domain is DUF1501 



(PF07394). DUF1 501 -containing proteins were present in 
Planctomycetes (I. pallida, G. obscuriglobus, Planctomyces, 
and Pirellula species), Verrucomicrobia (Par. parvula, V. spino- 
sum, C. flavus, and Coraliomargarita akajimensis), and 
Lentisphaerae (L. araneosa) genomes, as well as a few species 
of Bacteroidetes, Acidobacteria, and Proteobacteria (supple- 
mentary fig. S14, Supplementary Material online). All the or- 
ganisms in this list either have been shown to contain 
intracellular membranes or have not been analyzed in terms 
of their cellular structure. DUF1 501 -containing proteins are 
absent from the basal Planctomycetes species, Kuenenia stutt- 
gartiensis, which is known to possess intracellular membranes. 
This suggests that DUF1 501 is not related to the primary emer- 
gence of intracellular membranes within PVC but might be 
involved in processes that are associated with the intracellular 
membranes in more recently derived Planctomycetes and 
Verruconicrobia. It might also indicate a distinct genetic basis 
for compartmentalization in "anammox" species compared 
with organisms of the class Planctomycetia. The large number 
of DUF1 501 -containing proteins indicated that they are 
involved in a variety of related cellular processes and are 
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important for the organisms; therefore we carried out further 
characterization of DUF1 501 -containing proteins. 

To identify proteins or protein domains functionally related 
to DUF1501, we conducted analysis of the genomic neigh- 
borhood of all DUF1 501 -containing proteins. This analysis re- 
vealed the existence of genomic clusters of varying structure 
and domain composition (fig. 4 and supplementary fig. S14, 
Supplementary Material online). The proteins most strongly 
associated with DUF1 501 featured a cytochrome c motif con- 
taining domains PSCytl and PSCyt2 (PF07635 and PF07583) 
and domain of unknown function PSD1 (PF07587), previously 
identified to be specific for R. baltica (Studholme et al. 2004). 
Figure 4 shows the domain composition of several genomic 
clusters containing DUF1 501 proteins found in the genome of 
V. spinosum, present in a variety of other PVC species and 
showing clear grouping within the phylogenetic tree of 
DUF1 501 -containing proteins (supplementary fig. S14, Sup- 
plementary Material online). Some DUF1 501 -containing pro- 
teins were not associated with proteins bearing PSCyt or PSD1 
domains. Another genomic locus architecture included a 
DUF1 501 -containing protein (sometimes also carrying a 
twin-arginine motif or signal peptide) and a protein with 
PSCytl /PSCyt2/PSD1 architecture. The second protein in the 
genomic cluster sometimes contained additional Laminin_G_3 
or F5_F8_type_C domains classically involved in carbohydrate 
binding (Sharon and Lis 1972). The most complex gene clus- 
ters involved four genes with the following predicted domain 
architecture: 1 ) DUF1 501 , sometimes with twin-arginine signal 
peptide; 2) protein with weak support for one or several PPC 
domains normally found in secreted bacterial peptidases 
(Yeats et al. 2003), signal peptide, and conserved regions 
without characterized signatures; 3) PSCytl /Big_2/PSCyt2/ 
PSD1 protein; and 4) PSCytl /WD40 protein. Domains Big_2 
and WD40 are known to be involved in protein-protein inter- 
actions (Kelly et al. 1999; Xu and Min 2011) and probably 
are responsible for protein complex assembly or substrate rec- 
ognition. The correlation coefficient between the presence/ 
absence of intracellular membrane and the phylogenetic 



distribution of genomic loci of this structure was 0.852, indi- 
cating a strong association of this genetic module with intra- 
cellular membranes. Several organisms of undetermined 
cellular structure also encode the genomic clusters, suggesting 
that they might also possess internal membranes. 

The large number of genomic clusters, diversity of their 
domain composition, and high degree of evolutionary conser- 
vation of DUF1 501 -containing genomic loci suggests that 
DUF1501, PSCyt1/2, and PSD1 domain containing proteins 
are involved in a variety of related cellular processes and are 
important for the organisms. From an evolutionary stand- 
point, the topology of the large-scale phylogenetic tree for 
DUF1501 containing proteins shows clusters of DUF1501- 
bearing genes from the loci of similar domain architecture 
(supplementary fig. S14, Supplementary Material online). 
DUF1 501 -containing proteins are preferentially present 
within Planctomycetes, Verrucomicrobia, and Lentisphaerae. 
This suggests that this protein domain 1) has emerged on 
the lineage leading to the common ancestor of PVC organ- 
isms, 2) duplicated a number of times and acquired a func- 
tional relationship with a variety of proteins on the PVC 
evolutionary lineage, and 3) has been subsequently trans- 
mitted vertically through the PVC clade of the species tree 
with a number of loss events occurring within the clade. 
This implies that the majority of the observed domain arrange- 
ments for DUF1 501 -containing genomic loci have emerged 
on the PVC evolutionary lineage in a time of ~100 Myr, con- 
stituting a rapid phase of functional innovation within 
DUF1 501 -associated genetic modules. This is consistent with 
a currently accepted model of protein evolution where 
domain rearrangements play a major role in evolutionary 
adaptations (Bornberg-Bauer et al. 2010). Some domain ar- 
rangements appear to be reinvented within a number of gen- 
omic loci (supplementary fig. S14, Supplementary Material 
online), which is in agreement with the current view of con- 
vergent evolution of domain architectures being rather 
common (Forslund et al. 2008). 
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Fig. 4. — Domain structure of several different types of genomic loci including proteins that contain DUF1501, PSCyt1/2, and PSD1 domains in the 
Verrucomicrobium spinosum genome. Evolutionary relationships between several DUF1501 domain-containing proteins from the V. spinosum genome, 
representing genomic loci of different architecture conserved across several species, are shown. Gl numbers of corresponding V. spinosum proteins are used 
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Association of this system with intracellular membrane 
compartmentalization (within the class Planctomycetia and 
other PVC species) is additionally supported by the presence 
of canonically membrane-associated putative cytochromes 
and signal peptides within the described units. Additional 
intracellular membranes in this case might constitute alterna- 
tive destinations for canonically outer membrane-targeted 
proteins. The presence of a number of carbohydrate- and 
protein-binding units within PSCyt domain-carrying genes 
suggests that those genes encode outer membrane-asso- 
ciated (or according to our hypothesis, intracellular mem- 
brane-associated) cytochromes transferring electrons to 
specific acceptors (possibly proteins and sugars). At the 
same time, the presence of a twin-arginine signal peptide 
within DUF1 501 -bearing proteins points to an enzymatic 
function, which requires cofactors acquired in the cytoplasm 
of the cell (Lee et al. 2006) and carried out either in the 
periplasm (or possibly within additional intracellular compart- 
ments) or outside the cell. However, the function itself remains 
unclear. 

Overall, the described ensemble of genetic modules consti- 
tutes an interesting example of a novel molecular system 
within a number of PVC genomes. The high level of evolu- 
tionary conservation and the number of times the clusters 
have been duplicated in the genomes suggests an important 
and lineage-specific role of encoded proteins in the bacterial 
cell. It also implicates involvement of DUF1501 and PSCyt/ 
PSD-containing proteins in a large number of cellular pro- 
cesses. All these make this system an attractive target for 
future functional studies. Recent breakthroughs in developing 
genetic tools for Verrucomicrobia and Planctomycetes species 
(Domman et al. 201 1 ; Jogler et al. 201 1 ) provide opportunities 
to test hypotheses presented here. 

Conclusions 

Here, we uncovered several features of PVC genome evolu- 
tion. First, organisms in the super-phylum evolved primarily via 
genome shrinkage within the chlamydial clade, genome ex- 
pansion within the Planctomycete clade, and a balance of 
these two processes on the lineages within Verrucomicrobia 
and Lentisphaerae. Second, acquisition of novel genes (gene 
birth) has been a consistent characteristic of PVC genome 
evolution, but the rate of gene birth varies among lineages 
of the clade and the rate of gene acquisition has increased on 
a variety of recent Planctomycetes lineages and on the ances- 
tral PVC lineage. Third, large numbers of genes were acquired 
by extant and ancestral PVC organisms from Acidobacteria, 
delta Proteobacteria and, in the case of A. muciniphila, from 
Bacteroidetes. 

We also described an example of a complex molecular 
system that evolved within the PVC super-phylum. Consider- 
ing the predicted evolutionary origin, phylogenetic distribu- 
tion, and revealed evolutionary patterns, the system is likely 



to be involved in multiple processes and be uniquely important 
for PVC bacteria bearing those genes. The phylogenetic dis- 
tribution of the proteins seems to correlate nondeterministic- 
ally with the presence/absence of the additional intracellular 
membrane in Planctomycetes and Verrucomicrobia (correl- 
ation coefficient of 0.852 for the DUF1 501 -containing protein 
family from the most complex four-gene genomic cluster). 
Further study of these novel proteins might provide insights 
into the emergence of complex cellular structures within bac- 
teria. Ultimately, the combination of high-throughput analysis 
and detailed analysis of individual protein families constitutes a 
useful approach for understanding the evolution of bacterial 
genomes. 

Supplementary Material 

Supplementary figures S1-S14 and tables S1-S5 are available 
at Genome Biology and Evolution online (http://www.gbe. 
oxfordjournals.org/). 
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